Update turbomind modeling infrastructure#4557
Open
lzhangzz wants to merge 13 commits intoInternLM:mainfrom
Open
Update turbomind modeling infrastructure#4557lzhangzz wants to merge 13 commits intoInternLM:mainfrom
lzhangzz wants to merge 13 commits intoInternLM:mainfrom
Conversation
…t loading, and model loader 674 squashed commits: reorganize turbomind directory structure, refactor weight loading to support heterogeneous weight data types, add WeightFormat enum, replace BaseOutputModel/TextModelLoader with unified ModelLoader, eliminate data_format threading from Linear, and remove dead code.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors TurboMind’s modeling + conversion stack by replacing the legacy “deploy/source_model + config dataclasses” pipeline with a spec/builder-driven module system, adding a C++ module registry and new weight-module types, and updating engine/model code to consume the new weight tree.
Changes:
- Introduces a registry-backed C++
core::Moduleinfrastructure (plusDataFormat) and new modular weight classes (Linear/Norm/Attention/FFN/MoE/DeltaNet/ModelRoot/ModelWeight). - Reworks the Python-side TurboMind converter to use
TextModelSpec+ builders/model loader, and removes the legacylmdeploy.turbomind.deploypipeline. - Updates engine/model runtime plumbing (TurboMind API, Engine/SequenceManager, llama layers) to use the new module/weight tree.
Reviewed changes
Copilot reviewed 131 out of 131 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_lmdeploy/test_turbomind/test_converter.py | Removes legacy converter tests; leaves a remaining test that still references removed legacy modules. |
| tests/test_lmdeploy/test_turbomind/test_compressed_tensors.py | Adjusts compressed-tensors tests but still imports removed legacy deploy modules. |
| tests/test_lmdeploy/test_converter.py | Adds tests for _deep_merge plus a logging capture fixture. |
| src/turbomind/utils/memory_utils.h | Declares dtype-cast kernel + in-place ensure-float-dtype helper. |
| src/turbomind/utils/memory_utils.cu | Implements dtype casting and EnsureFloatDtype. |
| src/turbomind/turbomind.h | Updates TurboMind API to accept EngineConfig and expose module roots + TP ranks. |
| src/turbomind/python/CMakeLists.txt | Ensures static registrars are linked into the Python extension via --whole-archive. |
| src/turbomind/models/output_processor.h | Refactors ctor signature to avoid ModelParam dependency. |
| src/turbomind/models/output_processor.cc | Implements updated OutputProcessor ctor signature. |
| src/turbomind/models/norm_weight.h | Adds a NormWeight module type. |
| src/turbomind/models/norm_weight.cc | Registers and prepares NormWeight (dtype ensure). |
| src/turbomind/models/moe_weight.h | Adds a modular MoeWeight definition/config. |
| src/turbomind/models/moe_weight.cc | Implements MoE expert linking into a fused block view. |
| src/turbomind/models/model_weight.h | Adds root ModelWeight module for full weight tree. |
| src/turbomind/models/model_weight.cc | Implements ModelWeight prepare/verify + derived metadata. |
| src/turbomind/models/model_root.h | Adds ModelRoot sentinel for stream/allocator ownership. |
| src/turbomind/models/model_root.cc | Implements ModelRoot runtime context + prepare checks. |
| src/turbomind/models/llama/unified_decoder.h | Updates decoder to consume ModelWeight/DecoderLayerWeight. |
| src/turbomind/models/llama/unified_attention_layer.h | Refactors attention layer to use new AttentionWeight and rope config. |
| src/turbomind/models/llama/moe_ffn_layer.h | Refactors MoE FFN layer to use MoeWeight. |
| src/turbomind/models/llama/llama_rope.h | Moves rope param helpers out (now in AttentionWeight impl). |
| src/turbomind/models/llama/llama_params.h | Replaces model/attn/moe params with EngineConfig-based EngineParam. |
| src/turbomind/models/llama/SequenceManager.h | Updates ctor signature to explicit scalar params (no ModelParam). |
| src/turbomind/models/llama/SequenceManager.cc | Implements updated SequenceManager state sizing and cache layout. |
| src/turbomind/models/llama/LlamaWeight.h | Removes old monolithic LlamaWeight. |
| src/turbomind/models/llama/LlamaWeight.cc | Removes old monolithic LlamaWeight implementation. |
| src/turbomind/models/llama/LlamaLinear.h | Switches linear ops to new LinearWeight. |
| src/turbomind/models/llama/LlamaLinear.cu | Implements GEMM path using LinearWeight formats/descriptors. |
| src/turbomind/models/llama/LlamaFfnLayer.h | Refactors FFN layer to consume FfnWeight. |
| src/turbomind/models/llama/LlamaFfnLayer.cc | Updates FFN forward path for new weight module layout. |
| src/turbomind/models/llama/LlamaDenseWeight.h | Removes old dense/attention/ffn weight structs. |
| src/turbomind/models/llama/LlamaDecoderLayerWeight.h | Removes old llama-specific decoder layer weight. |
| src/turbomind/models/llama/LlamaDecoderLayerWeight.cc | Removes old llama-specific decoder layer weight impl. |
| src/turbomind/models/llama/GatedDeltaNetWeight.h | Removes old DeltaNet weight module. |
| src/turbomind/models/llama/GatedDeltaNetWeight.cc | Removes old DeltaNet weight module impl. |
| src/turbomind/models/llama/GatedDeltaNetLayer.h | Updates GDN layer to consume DeltaNetWeight. |
| src/turbomind/models/llama/CMakeLists.txt | Adjusts llama static lib sources (legacy pieces removed). |
| src/turbomind/models/linear_weight.h | Adds new LinearWeight module and format helpers. |
| src/turbomind/models/language_model.h | Switches LanguageModel to accept ModelWeight. |
| src/turbomind/models/input_processor.h | Refactors ctor to avoid ModelParam dependency. |
| src/turbomind/models/input_processor.cc | Implements updated ctor; allocates embed buffers from explicit dims/dtype. |
| src/turbomind/models/ffn_weight.h | Adds FfnWeight module and config. |
| src/turbomind/models/ffn_weight.cc | Implements FfnWeight::prepare (epilogue + grouped flag propagation). |
| src/turbomind/models/delta_net_weight.h | Adds DeltaNetWeight module and config. |
| src/turbomind/models/delta_net_weight.cc | Implements DeltaNetWeight::prepare dtype enforcement. |
| src/turbomind/models/decoder_layer_weight.h | Adds architecture-independent DecoderLayerWeight composite. |
| src/turbomind/models/decoder_layer_weight.cc | Implements verify rules and registers the module. |
| src/turbomind/models/attention_weight.h | Adds AttentionWeight module and embedded RopeConfig. |
| src/turbomind/models/attention_weight.cc | Implements rope kernel param init and registers AttentionWeight. |
| src/turbomind/models/CMakeLists.txt | Adds new module sources to models library; removes legacy llama weight sources. |
| src/turbomind/kernels/quantization.cu | Makes QuantizeSymm dtype-dispatched. |
| src/turbomind/kernels/gemm/convert_v3.cu | Comment tweak for “no quantization” case. |
| src/turbomind/kernels/gemm/CMakeLists.txt | Comments out legacy gemm test executables. |
| src/turbomind/engine/engine_config.h | Introduces EngineConfig struct (X-macro fields). |
| src/turbomind/engine/engine.h | Updates Engine ctor signature (now takes ModelWeight). |
| src/turbomind/engine/engine.cc | Refactors Engine to derive runtime fields from ModelWeight rather than ModelParam. |
| src/turbomind/core/test_data_format.cc | Adds Catch2 tests for DataFormat/ResolveLinearWeightFormat. |
| src/turbomind/core/registry.h | Adds module type registry + registration macro. |
| src/turbomind/core/registry.cc | Implements module registry. |
| src/turbomind/core/module.cc | Rewrites module base + ModuleList implementation and hooks up registry-based creation. |
| src/turbomind/core/data_format.h | Adds DataFormat + quant-param descriptors and helpers. |
| src/turbomind/core/data_format.cc | Implements DataFormat logic and ResolveLinearWeightFormat. |
| src/turbomind/core/CMakeLists.txt | Builds new core sources + adds data_format test. |
| src/turbomind/CMakeLists.txt | Adjusts turbomind link libs (removes yaml-cpp). |
| scripts/test_turbomind_model.py | Adds a CLI smoke-test script for TurboMind models. |
| lmdeploy/turbomind/supported_models.py | Narrows/updates supported arch mapping and simplifies checks. |
| lmdeploy/turbomind/spec.py | Adds TextModelSpec base (HF parsing → C++ configs + weight commits). |
| lmdeploy/turbomind/models/base.py | Introduces new INPUT_MODELS registry for spec classes. |
| lmdeploy/turbomind/models/init.py | Imports/registers available specs. |
| lmdeploy/turbomind/model_loader.py | Adds ModelLoader to bind runtime handles and load weights into TM. |
| lmdeploy/turbomind/loader.py | Adds all_items() API to loaders for spec-driven loading. |
| lmdeploy/turbomind/linear.py | Adds Linear bundle type and padding/concat helpers. |
| lmdeploy/turbomind/deploy/target_model/fp.py | Removes legacy deploy output model stub. |
| lmdeploy/turbomind/deploy/target_model/init.py | Removes legacy deploy target_model exports. |
| lmdeploy/turbomind/deploy/source_model/xcomposer2.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/molmo.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/mixtral.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/minicpmv.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/llava.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/internvl.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/internlm2.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/gpt_oss.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/glm4_moe_lite.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/glm4.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/deepseek_vl.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/deepseek2.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/base.py | Removes legacy deploy registries/base classes. |
| lmdeploy/turbomind/deploy/source_model/baichuan.py | Removes legacy deploy reader/model. |
| lmdeploy/turbomind/deploy/source_model/init.py | Removes legacy deploy source_model imports. |
| lmdeploy/turbomind/deploy/policy.py | Removes legacy tensor processing policy helpers. |
| lmdeploy/turbomind/deploy/parameter.py | Removes legacy parameter export utilities. |
| lmdeploy/turbomind/deploy/config.py | Removes legacy turbomind model config dataclasses. |
| lmdeploy/turbomind/deploy/init.py | Removes legacy deploy package init. |
| lmdeploy/turbomind/builders/norm.py | Adds builder for Norm module commits. |
| lmdeploy/turbomind/builders/moe.py | Adds builder for MoE non-expert params and gate commits. |
| lmdeploy/turbomind/builders/module_list.py | Adds builder for ModuleList container commits. |
| lmdeploy/turbomind/builders/mla.py | Adds MLA fold/pad pipeline + builder. |
| lmdeploy/turbomind/builders/deltanet.py | Adds DeltaNet fusion helpers + builder. |
| lmdeploy/turbomind/builders/decoder_layer.py | Adds a decoder-layer container builder. |
| lmdeploy/turbomind/builders/attention.py | Adds attention fusion pipeline + builder. |
| lmdeploy/turbomind/builders/init.py | Exposes builder APIs. |
| lmdeploy/messages.py | Changes Response.__repr__ formatting. |
| lmdeploy/archs.py | Changes ImportError handling in backend auto-selection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Move dequant/transform utilities from _base.py into linear.py as the canonical home for all Linear operations - Unify _ensure_compatible_formats and dequant_mixed into a single dequant_mixed function that triggers on any format diversity - Drop 'Spec' suffix from all turbomind model classes and files (TextModelSpec → TextModel, Qwen3TextSpec → Qwen3TextModel, etc.) - Extract TextModelBuilder from _base.py into builders/text_model.py - Move model-specific qk_norm from TextModel to Qwen3 and Qwen3.5 - Fix .gitignore typo (trubomind → turbomind)
Collaborator
|
For the /nvme4/huggingface_hub/hub/models--Qwen--Qwen3.5-2B/snapshots/15852e8c16360a2fea060d615a32b45270f8a8fc/ model, the results differ from those of the main branch. this branch main branch |
…anup Align all turbomind source models with Qwen3 conventions: - Drop engine_cfg from model signatures, wire data_type via Context - Add Context, ParallelGroup, make_moe_config, make_mla_config helpers - Collapse make_*_config functions by removing per-function data_type - Remove dead fields from C++ configs (has_bias, hidden_dim, etc.) - Remove _layer_pattern, _embed_key, _norm_key from all models - Unify FFN padding with group-based pad/round_up helpers - Add TP padding for block-quantized formats and GEMM K-alignment - Remove dead code: _pad_1d, _norm, pad_in_dim, _softmax_scale - Add InternVL3.5, InternLM2/3, Llama turbomind support - Rename fused_moe to is_expert, align Python/C++ config fields - Use direct HF config access, Transformers type hints, all-params loader - Clean up imports, docstrings, formatting across all model files
The raise in archs.py was a debugging leftover. The repr in messages.py needs !r to properly escape control characters.
Replace turbomind's flat-string-prefix model loading with a typed Checkpoint (storage backend) and Prefix (path navigation) pair, making source models stateless and decoupling topology from storage. Core infrastructure: - Add Checkpoint ABC with SafetensorsCheckpoint and PytorchCheckpoint subclasses - Add Prefix for typed checkpoint path navigation (+, slices, get, pop) - Add .cuda() device transfer and .pop() for single-use weights to Checkpoint - Replace Prefix.chunks with Prefix.slices for cleaner layer iteration - Thread index parameter through resolver, dropping post-hoc indexing Architecture migration: - Bridge ModelLoader.export to Checkpoint/Prefix via per-class _uses_prefix flag - Make WeightFormatResolver accept Prefix alongside legacy (params, prefix) tuple - Migrate all 8 architectures to Checkpoint/Prefix: llama, qwen2, qwen3, internvl3_5, internlm2, glm4_moe_lite, qwen3_5, gpt_oss - Drop legacy dict-based loading after migration - Inline shard walking into checkpoint.py and strip loader.py - Remove layer_progress helper, unused imports Norm refactoring: - Change norm() to accept Prefix instead of raw tensor - Use pop() for single-use norm weights across all architectures - Update qwen3_5 norm() calls to use Prefix with transforms Packed expert index: - Add index parameter to Checkpoint.get/pop for packed expert weight access - Fix get() vs pop() semantics for packed expert weights Style: - Reflow long lines, fix lint violations (UP037, imports), align whitespace
- Resolve conflicts by keeping our refactored architecture - Thread trust_remote_code through from_pretrained → __init__ → _from_hf → is_supported → get_tm_config → get_model_arch - Add is_cublas_grouped check to _should_fuse_silu: disable fused SiLU for bf16 MoE on SM100+ GPUs (CublasGroupedKernel limitation)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.